SOMETHING ABOUT THE INSURANCE DATA
The following packages are used in Python version of the tutorial.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant
#read the data and save as a dataframe
ds0 = pd.DataFrame(pd.read_csv('~/Desktop/506/groupproject/insurance.csv'))
#have a glimpse of the data
print(ds0.head())
## age sex bmi children smoker region charges
## 0 19 female 27.900 0 yes southwest 16884.92400
## 1 18 male 33.770 1 no southeast 1725.55230
## 2 28 male 33.000 3 no southeast 4449.46200
## 3 33 male 22.705 0 no northwest 21984.47061
## 4 32 male 28.880 0 no northwest 3866.85520
print(ds0.columns)
#check if there is any null
## Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')
ds0.isnull().sum()
#no null :)
# map the character to factor variable
## age 0
## sex 0
## bmi 0
## children 0
## smoker 0
## region 0
## charges 0
## dtype: int64
ds0.sex = ds0.sex.map({'female':1, 'male':0})
ds0.smoker = ds0.smoker.map({'yes':1, 'no':0})
ds_reg = pd.get_dummies(ds0['region'])
# since we change it to dummy variables, we have to drop one of the column
ds0 = ds0.join(ds_reg.iloc[:,0:3])
ds0['logcharges'] = np.log(ds0.charges)
ds1 = ds0.drop(['charges','region'],axis=1)
import seaborn as sns
import matplotlib.pyplot as plt
# As we can see, the charges have a heavy tail. So, we take log on it.
sns.distplot(ds0['charges'], color='b')
# Data in every other column looks fine right now.
f,ax = plt.subplots(2,3,figsize=(10,8))
sns.distplot(ds1["age"], kde=False, ax=ax[0,0])
sns.boxplot(x='sex',y='charges', data=ds0, ax=ax[0,1])
sns.distplot(ds1['logcharges'], ax=ax[0,2], kde=False, color='b')
# The logcharges are now normally distributed.
sns.distplot(ds1['bmi'],ax=ax[1,0], kde=False, color='b')
sns.countplot('children',data=ds1, ax=ax[1,1])
sns.countplot('region',data=ds0, ax=ax[1,2])
ax[0,0].set_title('Distribution of Ages')
ax[0,1].set_title('Charges boxplot by Sex')
ax[0,2].set_title('Distribution of log charges')
ax[1,0].set_title('Distribution of bmi')
ax[1,1].set_title('Distribution of children')
ax[1,2].set_title('Distribution of regions')
Actually centralize is not a must, which would not change the general result of the regression. In this example, we won’t do so.
def centralize(arra):
cen = arra - np.mean(arra)
var = np.sqrt(sum(cen**2)/(len(arra)-1))
arra = cen/var
return arra
ds0.bmi = centralize(ds0.bmi)
ds0.age = centralize(ds0.age)
Firstly, we define dependent variable y and covariates X for future convenience.
y = ds1.logcharges
X = ds1.drop(['logcharges'], axis = 1)
#Conduct the first regression!
regr_1 = OLS(y, add_constant(X)).fit()
def variance_inflation_factor(exog, exog_idx):
k_vars = exog.shape[1]
x_i = exog.iloc[:, exog_idx]
mask = np.arange(k_vars) != exog_idx
x_noti = exog.iloc[:, mask]
r_squared_i = OLS(x_i, x_noti).fit().rsquared
vif = 1. / (1. - r_squared_i)
return vif
# We skip the constant column.
VIF = [variance_inflation_factor(add_constant(X), i) for i in range(1,X.shape[1]+1)]
print(VIF)
## [1.0168221490038107, 1.0089001621005733, 1.106629732428617, 1.004010642137024, 1.0120736649061481, 1.5262104138696806, 1.524748317955475, 1.5971027909244977]
The vif of each column is ok. All of them are smaller than 5, even 2.
sns.distplot(regr_1.resid)
Acting like normal which is good. Since the residual itself is normal, box-cox is not necessary.
namda = 0.1
regr_test = OLS((y**namda-1)/namda, add_constant(X)).fit()
sns.jointplot((y**namda-1)/namda, regr_test.resid)
sns.distplot(regr_test.resid)
If the regression works well, the dependent data should be uncorrelated with the residual. If we have gaussian prerequisite, independence is what we expect.
sns.jointplot(y, regr_1.resid)
The residual plot looks very strange. Y and residual are highly dependent. Maybe the model is not linear at the first place. Since there is explicit non-linear in this model, we have to add some non-linear covariates in it.
Which attempts to show how covariate is related to dependent variable, if we control for the effects of all other covariates.
f,ax = plt.subplots(1,2,figsize=(10,8))
sns.jointplot(regr_1.params.bmi * X.bmi + regr_1.resid, X.bmi, ax=ax[0,0])
sns.jointplot(regr_1.params.age * X.age + regr_1.resid, X.age, ax=ax[0,1])
Partial residual plots show that both bmi and age could significantly affect the charges.
For this part, we would try different adding variables and try to drop variables that are useless. The primary concern in this case is that we have to add variables so that the residual is relatively indep with y.
print(regr_1.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.768
## Model: OLS Adj. R-squared: 0.767
## Method: Least Squares F-statistic: 549.8
## Date: Sun, 02 Dec 2018 Prob (F-statistic): 0.00
## Time: 20:01:19 Log-Likelihood: -808.52
## No. Observations: 1338 AIC: 1635.
## Df Residuals: 1329 BIC: 1682.
## Df Model: 8
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 6.8262 0.076 90.168 0.000 6.678 6.975
## age 0.0346 0.001 39.655 0.000 0.033 0.036
## sex 0.0754 0.024 3.091 0.002 0.028 0.123
## bmi 0.0134 0.002 6.381 0.000 0.009 0.017
## children 0.1019 0.010 10.085 0.000 0.082 0.122
## smoker 1.5543 0.030 51.333 0.000 1.495 1.614
## northeast 0.1290 0.035 3.681 0.000 0.060 0.198
## northwest 0.0652 0.035 1.863 0.063 -0.003 0.134
## southeast -0.0282 0.034 -0.819 0.413 -0.096 0.039
## ==============================================================================
## Omnibus: 463.882 Durbin-Watson: 2.046
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 1673.760
## Skew: 1.679 Prob(JB): 0.00
## Kurtosis: 7.330 Cond. No. 330.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
X_2 = X.iloc[:,:]
X_2['sm_bm'] = X_2.smoker * X_2.bmi
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
# which certainly improve the performance of the model
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.784
## Model: OLS Adj. R-squared: 0.782
## Method: Least Squares F-statistic: 534.0
## Date: Sun, 02 Dec 2018 Prob (F-statistic): 0.00
## Time: 20:01:19 Log-Likelihood: -762.05
## No. Observations: 1338 AIC: 1544.
## Df Residuals: 1328 BIC: 1596.
## Df Model: 9
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 7.1128 0.079 90.254 0.000 6.958 7.267
## age 0.0348 0.001 41.281 0.000 0.033 0.036
## sex 0.0871 0.024 3.688 0.000 0.041 0.133
## bmi 0.0034 0.002 1.502 0.133 -0.001 0.008
## children 0.1031 0.010 10.569 0.000 0.084 0.122
## smoker 0.1564 0.146 1.071 0.284 -0.130 0.443
## northeast 0.1375 0.034 4.062 0.000 0.071 0.204
## northwest 0.0664 0.034 1.964 0.050 8.84e-05 0.133
## southeast -0.0252 0.033 -0.757 0.449 -0.091 0.040
## sm_bm 0.0456 0.005 9.773 0.000 0.036 0.055
## ==============================================================================
## Omnibus: 520.045 Durbin-Watson: 2.036
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 2150.321
## Skew: 1.846 Prob(JB): 0.00
## Kurtosis: 7.994 Cond. No. 661.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
X_2['sm_ag'] = X_2.smoker*X_2.age
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
# which increase the performance of the model significantly
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.825
## Model: OLS Adj. R-squared: 0.824
## Method: Least Squares F-statistic: 625.0
## Date: Sun, 02 Dec 2018 Prob (F-statistic): 0.00
## Time: 20:01:19 Log-Likelihood: -620.29
## No. Observations: 1338 AIC: 1263.
## Df Residuals: 1327 BIC: 1320.
## Df Model: 10
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 6.8960 0.072 95.828 0.000 6.755 7.037
## age 0.0416 0.001 48.930 0.000 0.040 0.043
## sex 0.0856 0.021 4.030 0.000 0.044 0.127
## bmi 0.0012 0.002 0.563 0.573 -0.003 0.005
## children 0.1063 0.009 12.106 0.000 0.089 0.124
## smoker 1.2793 0.146 8.769 0.000 0.993 1.566
## northeast 0.1509 0.030 4.953 0.000 0.091 0.211
## northwest 0.0860 0.030 2.827 0.005 0.026 0.146
## southeast 0.0061 0.030 0.202 0.840 -0.053 0.065
## sm_bm 0.0510 0.004 12.130 0.000 0.043 0.059
## sm_ag -0.0334 0.002 -17.698 0.000 -0.037 -0.030
## ==============================================================================
## Omnibus: 860.664 Durbin-Watson: 1.998
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 8198.732
## Skew: 2.969 Prob(JB): 0.00
## Kurtosis: 13.574 Cond. No. 744.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Since we only have two continuous covariates, we can try to give them an extra power.
X_2['bmi^1.5'] = X_2.bmi ** 1.5
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.826
## Model: OLS Adj. R-squared: 0.825
## Method: Least Squares F-statistic: 574.1
## Date: Sun, 02 Dec 2018 Prob (F-statistic): 0.00
## Time: 20:01:19 Log-Likelihood: -614.16
## No. Observations: 1338 AIC: 1252.
## Df Residuals: 1326 BIC: 1315.
## Df Model: 11
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 5.9730 0.274 21.823 0.000 5.436 6.510
## age 0.0414 0.001 48.940 0.000 0.040 0.043
## sex 0.0859 0.021 4.060 0.000 0.044 0.127
## bmi 0.0928 0.026 3.528 0.000 0.041 0.144
## children 0.1065 0.009 12.175 0.000 0.089 0.124
## smoker 1.2759 0.145 8.782 0.000 0.991 1.561
## northeast 0.1564 0.030 5.147 0.000 0.097 0.216
## northwest 0.0848 0.030 2.797 0.005 0.025 0.144
## southeast 0.0149 0.030 0.497 0.619 -0.044 0.074
## sm_bm 0.0513 0.004 12.243 0.000 0.043 0.060
## sm_ag -0.0335 0.002 -17.815 0.000 -0.037 -0.030
## bmi^1.5 -0.0110 0.003 -3.494 0.000 -0.017 -0.005
## ==============================================================================
## Omnibus: 861.038 Durbin-Watson: 1.995
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 8243.481
## Skew: 2.968 Prob(JB): 0.00
## Kurtosis: 13.612 Cond. No. 4.89e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 4.89e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
X_2['age^1.5'] = X_2.age ** 1.5
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.828
## Model: OLS Adj. R-squared: 0.827
## Method: Least Squares F-statistic: 532.2
## Date: Sun, 02 Dec 2018 Prob (F-statistic): 0.00
## Time: 20:01:19 Log-Likelihood: -607.51
## No. Observations: 1338 AIC: 1241.
## Df Residuals: 1325 BIC: 1309.
## Df Model: 12
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 5.5668 0.294 18.907 0.000 4.989 6.144
## age 0.0776 0.010 7.785 0.000 0.058 0.097
## sex 0.0855 0.021 4.060 0.000 0.044 0.127
## bmi 0.0919 0.026 3.508 0.000 0.041 0.143
## children 0.0966 0.009 10.583 0.000 0.079 0.114
## smoker 1.2704 0.145 8.784 0.000 0.987 1.554
## northeast 0.1566 0.030 5.179 0.000 0.097 0.216
## northwest 0.0858 0.030 2.844 0.005 0.027 0.145
## southeast 0.0148 0.030 0.497 0.619 -0.044 0.073
## sm_bm 0.0514 0.004 12.326 0.000 0.043 0.060
## sm_ag -0.0335 0.002 -17.881 0.000 -0.037 -0.030
## bmi^1.5 -0.0108 0.003 -3.466 0.001 -0.017 -0.005
## age^1.5 -0.0039 0.001 -3.639 0.000 -0.006 -0.002
## ==============================================================================
## Omnibus: 895.912 Durbin-Watson: 1.992
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 9255.812
## Skew: 3.102 Prob(JB): 0.00
## Kurtosis: 14.293 Cond. No. 9.48e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.48e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
y_2 = ds0.charges
regr_test = OLS(y_2, add_constant(X_2)).fit()
print(regr_test.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: charges R-squared: 0.845
## Model: OLS Adj. R-squared: 0.843
## Method: Least Squares F-statistic: 600.3
## Date: Sun, 02 Dec 2018 Prob (F-statistic): 0.00
## Time: 20:01:19 Log-Likelihood: -13232.
## No. Observations: 1338 AIC: 2.649e+04
## Df Residuals: 1325 BIC: 2.656e+04
## Df Model: 12
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -7665.5804 3687.192 -2.079 0.038 -1.49e+04 -432.209
## age -320.3823 124.794 -2.567 0.010 -565.197 -75.567
## sex 509.3387 263.692 1.932 0.054 -7.960 1026.637
## bmi 1056.6806 328.082 3.221 0.001 413.063 1700.298
## children 678.7408 114.266 5.940 0.000 454.579 902.902
## smoker -2.029e+04 1811.135 -11.200 0.000 -2.38e+04 -1.67e+04
## northeast 1288.7634 378.774 3.402 0.001 545.701 2031.825
## northwest 616.4165 377.748 1.632 0.103 -124.634 1357.467
## southeast 122.5000 374.253 0.327 0.743 -611.693 856.693
## sm_bm 1444.5800 52.238 27.654 0.000 1342.101 1547.059
## sm_ag -3.7854 23.430 -0.162 0.872 -49.750 42.179
## bmi^1.5 -123.8640 39.080 -3.170 0.002 -200.529 -47.199
## age^1.5 62.3346 13.294 4.689 0.000 36.256 88.413
## ==============================================================================
## Omnibus: 737.144 Durbin-Watson: 2.068
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 4743.381
## Skew: 2.581 Prob(JB): 0.00
## Kurtosis: 10.645 Cond. No. 9.48e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.48e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
Now we have added all the possible covariates we can, we can consider drop some of them. As we can see from the summary, it seems that the region para is not that important.
X_3 = X_2.drop(columns = ['northwest', 'southeast'])
regr_3 = OLS(y_2, add_constant(X_3)).fit()
print(regr_3.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: charges R-squared: 0.844
## Model: OLS Adj. R-squared: 0.843
## Method: Least Squares F-statistic: 719.5
## Date: Sun, 02 Dec 2018 Prob (F-statistic): 0.00
## Time: 20:01:19 Log-Likelihood: -13233.
## No. Observations: 1338 AIC: 2.649e+04
## Df Residuals: 1327 BIC: 2.655e+04
## Df Model: 10
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -7462.5827 3670.004 -2.033 0.042 -1.47e+04 -262.941
## age -322.6577 124.831 -2.585 0.010 -567.546 -77.770
## sex 509.7468 263.786 1.932 0.054 -7.736 1027.230
## bmi 1075.2760 326.981 3.288 0.001 433.821 1716.732
## children 681.9706 114.227 5.970 0.000 457.886 906.056
## smoker -2.032e+04 1811.634 -11.219 0.000 -2.39e+04 -1.68e+04
## northeast 1036.4906 309.503 3.349 0.001 429.323 1643.658
## sm_bm 1444.4375 52.251 27.644 0.000 1341.934 1546.941
## sm_ag -3.0043 23.396 -0.128 0.898 -48.902 42.893
## bmi^1.5 -126.7912 38.882 -3.261 0.001 -203.068 -50.514
## age^1.5 62.5812 13.298 4.706 0.000 36.495 88.668
## ==============================================================================
## Omnibus: 740.495 Durbin-Watson: 2.062
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 4801.892
## Skew: 2.592 Prob(JB): 0.00
## Kurtosis: 10.698 Cond. No. 9.43e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.43e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
the AIC and BIC actually get smaller.
Similarly smoker:age is not that important either.
X_4 = X_3.drop(columns = ['sm_ag'])
regr_4 = OLS(y_2, add_constant(X_4)).fit()
print(regr_4.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: charges R-squared: 0.844
## Model: OLS Adj. R-squared: 0.843
## Method: Least Squares F-statistic: 800.1
## Date: Sun, 02 Dec 2018 Prob (F-statistic): 0.00
## Time: 20:01:20 Log-Likelihood: -13233.
## No. Observations: 1338 AIC: 2.649e+04
## Df Residuals: 1328 BIC: 2.654e+04
## Df Model: 9
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -7441.2915 3664.898 -2.030 0.043 -1.46e+04 -251.670
## age -323.1930 124.715 -2.591 0.010 -567.854 -78.532
## sex 509.8762 263.686 1.934 0.053 -7.411 1027.163
## bmi 1075.1073 326.857 3.289 0.001 433.895 1716.320
## children 681.6865 114.163 5.971 0.000 457.727 905.646
## smoker -2.043e+04 1630.461 -12.527 0.000 -2.36e+04 -1.72e+04
## northeast 1036.8028 309.378 3.351 0.001 429.879 1643.726
## sm_bm 1443.9491 52.093 27.719 0.000 1341.755 1546.143
## bmi^1.5 -126.7498 38.866 -3.261 0.001 -202.996 -50.504
## age^1.5 62.5735 13.292 4.707 0.000 36.497 88.650
## ==============================================================================
## Omnibus: 740.467 Durbin-Watson: 2.063
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 4799.401
## Skew: 2.592 Prob(JB): 0.00
## Kurtosis: 10.695 Cond. No. 9.41e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.41e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
After the model selection part, we now have our residual:
sns.residplot(np.sum(regr_4.params*X_4,1)+regr_4.params[0], y_2)
Which is much more acceptable than before.
For now, the best model would be: charges = constant + age + sex + bmi + children + smokeryes + northeastornot + sm_bm + bmi^1.5 + age^1.5
Caution: As you may see, the coefficient of smokeryes is negative. Please don’t take the result as smoking is good for your health. Since smoke is also used in smoke:bmi, bmi is large. So the bad influence of smoking has been transferred to the smoke:bmi term.
Data Analyis: Insurance
The following packages are used in R version of the tutorial.
# libraries--------------------------------------------------------------------
library(faraway)
library(readr)
library(dplyr)
source('./group19.R')
# Load data--------------------------------------------------------------------
ds0 <- read_csv("insurance.csv")
names(ds0)
## [1] "age" "sex" "bmi" "children" "smoker" "region"
## [7] "charges"
#have a glimpse of the data
head(ds0)
## # A tibble: 6 x 7
## age sex bmi children smoker region charges
## <int> <chr> <dbl> <int> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
# Check if is there any null---------------------------------------------------
is.na(ds0)
No missing value, we’re good.
# Recode: sex:femal = 1, male = 0, smoke: yes=1, no=0
ds0$sex[ds0$sex == "male"]="0"
ds0$sex[ds0$sex == "female"]="1"
ds0$smoker[ds0$smoker == "no"]="0"
ds0$smoker[ds0$smoker == "yes"]="1"
# Examining the distribution of each variables---------------------------------
hist(ds0$age,xlab="Age", main="Distribution of Age")
hist(ds0$bmi,xlab="BMI", main="Distribution of BMI")
hist(ds0$children,xlab="Children", main="Distribution of Children")
hist(ds0$charges,xlab="Charges", main="Distribution of Charges")
# Take log for charge since its heavy tail-------------------------------------
ds0$logcharges <- log(ds0$charges+1)
hist(ds0$charges)
ds0$logcharges <- log(ds0$charges+1)
hist(ds0$logcharges, breaks = 10)
Firstly, we define dependent variable y and covariates X. In this analysis, the reponse is logcharges and the preditors are age, sex, bmi, children, smoker, and region .
# Conduct the first regression!
# Since they're all smaller than 5, even smaller than 2.
fit0 <- lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region), data=ds0)
X <- model.matrix(fit0)[, -1]
round(vif(X),2)
## age sex1
## 1.02 1.01
## bmi children
## 1.11 1.00
## smoker1 as.factor(region)northwest
## 1.01 1.52
## as.factor(region)southeast as.factor(region)southwest
## 1.65 1.53
hist(fit0$residuals, xlab="Residuals")
plot(fit0$res, xlab="Residuals")
abline(h=0) # acting like noraml so it's good.
plot(fit)
# Partial residual plots-------------------------------------------------------
#Which attempts to show how covariate is related to dependent variable
# if we control for the effects of all other covariates
# partial residual plots look acceptable.
fit <- lm(logcharges~ bmi, data=ds0)
plot(fit)
For this part, we would try different adding variables and try to drop variables that are useless. The primary concern in this case is that we have to add variables so that the residual is relatively indep with y.
fit0 <- lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region), data=ds0)
summary(fit0)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region), data = ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07141 -0.19836 -0.04914 0.06601 2.16602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.9556815 0.0732494 94.959 < 2e-16 ***
## age 0.0345744 0.0008719 39.655 < 2e-16 ***
## sex1 0.0753873 0.0243966 3.090 0.002043 **
## bmi 0.0133740 0.0020956 6.382 2.41e-10 ***
## children 0.1018259 0.0100976 10.084 < 2e-16 ***
## smoker1 1.5541456 0.0302739 51.336 < 2e-16 ***
## as.factor(region)northwest -0.0637739 0.0348992 -1.827 0.067867 .
## as.factor(region)southeast -0.1571539 0.0350762 -4.480 8.09e-06 ***
## as.factor(region)southwest -0.1289211 0.0350206 -3.681 0.000241 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4442 on 1329 degrees of freedom
## Multiple R-squared: 0.768, Adjusted R-squared: 0.7666
## F-statistic: 549.8 on 8 and 1329 DF, p-value: < 2.2e-16
AIC(fit0)
## [1] 1636.536
BIC(fit0)
## [1] 1688.526
fit1 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+smoker*bmi, data=ds0)
summary(fit1)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi, data = ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91486 -0.17764 -0.05144 0.04640 2.22282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2508346 0.0769477 94.231 < 2e-16 ***
## age 0.0347880 0.0008427 41.280 < 2e-16 ***
## sex1 0.0870349 0.0236027 3.688 0.000236 ***
## bmi 0.0034055 0.0022672 1.502 0.133309
## children 0.1031176 0.0097574 10.568 < 2e-16 ***
## smoker1 0.1562914 0.1459708 1.071 0.284497
## as.factor(region)northwest -0.0711167 0.0337288 -2.108 0.035176 *
## as.factor(region)southeast -0.1626838 0.0338962 -4.799 1.77e-06 ***
## as.factor(region)southwest -0.1374810 0.0338491 -4.062 5.16e-05 ***
## bmi:smoker1 0.0455727 0.0046624 9.775 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4292 on 1328 degrees of freedom
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.7821
## F-statistic: 534.1 on 9 and 1328 DF, p-value: < 2.2e-16
BIC(fit1)
## [1] 1602.769
# which certainly improve the performance of the model
fit2 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+smoker*bmi+as.numeric(smoker)*age, data=ds0)
summary(fit2)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi + as.numeric(smoker) * age,
## data = ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65395 -0.15404 -0.06305 -0.01126 2.34914
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0475291 0.0701861 100.412 < 2e-16 ***
## age 0.0415643 0.0008495 48.928 < 2e-16 ***
## sex1 0.0855867 0.0212384 4.030 5.90e-05 ***
## bmi 0.0011518 0.0020440 0.563 0.5732
## children 0.1062970 0.0087818 12.104 < 2e-16 ***
## smoker1 1.2788788 0.1458651 8.768 < 2e-16 ***
## as.factor(region)northwest -0.0649078 0.0303520 -2.139 0.0327 *
## as.factor(region)southeast -0.1448205 0.0305173 -4.746 2.30e-06 ***
## as.factor(region)southwest -0.1508956 0.0304676 -4.953 8.26e-07 ***
## as.numeric(smoker) NA NA NA NA
## bmi:smoker1 0.0510336 0.0042067 12.132 < 2e-16 ***
## age:as.numeric(smoker) -0.0333924 0.0018870 -17.696 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3862 on 1327 degrees of freedom
## Multiple R-squared: 0.8249, Adjusted R-squared: 0.8235
## F-statistic: 625 on 10 and 1327 DF, p-value: < 2.2e-16
BIC(fit2)
## [1] 1326.493
# which increase the performance of the model significantly
Since we only have two continuous covariates, we can try to give them an extra power.
#since we only have two continuous covariates, we can try to give them an extra power, add bmi^1.5.
fit3 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+
smoker*bmi,as.numeric(smoker)*age+bmi^1.5, data=ds0)
summary(fit3)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) *
## age + bmi^1.5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03798 -0.17402 -0.03470 0.05657 2.04224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.1763930 0.0720456 99.609 < 2e-16 ***
## age 0.0375651 0.0007998 46.967 < 2e-16 ***
## sex1 0.1190553 0.0224451 5.304 1.32e-07 ***
## bmi -0.0015026 0.0023231 -0.647 0.5179
## children 0.1225808 0.0093304 13.138 < 2e-16 ***
## smoker1 0.0513262 0.1383354 0.371 0.7107
## as.factor(region)northwest 0.0100724 0.0315101 0.320 0.7493
## as.factor(region)southeast -0.0649199 0.0337726 -1.922 0.0548 .
## as.factor(region)southwest -0.1401404 0.0330701 -4.238 2.41e-05 ***
## bmi:smoker1 0.0495246 0.0045698 10.837 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4012 on 1328 degrees of freedom
## Multiple R-squared: 0.8077, Adjusted R-squared: 0.8064
## F-statistic: 619.8 on 9 and 1328 DF, p-value: < 2.2e-16
BIC(fit3)
## [1] 1422.136
fit4 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+
smoker*bmi,as.numeric(smoker)*age+bmi^1.5+age^1.5, data=ds0)
summary(fit4)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) *
## age + bmi^1.5 + age^1.5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91636 -0.18144 -0.05810 0.02661 2.13811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2645976 0.0841646 86.314 < 2e-16 ***
## age 0.0347150 0.0008966 38.720 < 2e-16 ***
## sex1 0.1085825 0.0249587 4.350 1.46e-05 ***
## bmi 0.0027220 0.0024320 1.119 0.2632
## children 0.0792971 0.0107489 7.377 2.84e-13 ***
## smoker1 0.1785370 0.1539089 1.160 0.2463
## as.factor(region)northwest -0.0821657 0.0375473 -2.188 0.0288 *
## as.factor(region)southeast -0.0423219 0.0348873 -1.213 0.2253
## as.factor(region)southwest -0.0817111 0.0350003 -2.335 0.0197 *
## bmi:smoker1 0.0419282 0.0048774 8.596 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4515 on 1328 degrees of freedom
## Multiple R-squared: 0.7465, Adjusted R-squared: 0.7447
## F-statistic: 434.4 on 9 and 1328 DF, p-value: < 2.2e-16
BIC(fit4)
## [1] 1738.278
fit5 <-lm(charges ~age+sex+bmi+children+smoker+as.factor(region)+
smoker*bmi,as.numeric(smoker)*age+bmi^1.5+age^1.5, data=ds0)
summary(fit5)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) *
## age + bmi^1.5 + age^1.5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9501.1 -2050.8 -1444.9 -458.8 24540.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1730.677 911.048 -1.900 0.057695 .
## age 257.277 9.705 26.510 < 2e-16 ***
## sex1 1003.023 270.168 3.713 0.000214 ***
## bmi -3.801 26.325 -0.144 0.885205
## children 238.749 116.353 2.052 0.040372 *
## smoker1 -20194.774 1666.003 -12.122 < 2e-16 ***
## as.factor(region)northwest -921.164 406.435 -2.266 0.023584 *
## as.factor(region)southeast -335.442 377.641 -0.888 0.374564
## as.factor(region)southwest -641.983 378.864 -1.694 0.090406 .
## bmi:smoker1 1422.834 52.796 26.949 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4887 on 1328 degrees of freedom
## Multiple R-squared: 0.8247, Adjusted R-squared: 0.8235
## F-statistic: 694.2 on 9 and 1328 DF, p-value: < 2.2e-16
BIC(fit5)
## [1] 26597.19
Caution: this procedure is not a typical one for model selectiom. But if we take off the log, the performance certainly get better.
Now we have added all the possible covariates we can, we can consider drop some of them. As we can see from the summary, it seems that the region parameter is not that important, so I’m only keep northeast in region column.
new_ds0 <- ds0[which(ds0$region=="northeast"),]
#Recode: sex:femal = 1, male = 0, smoke: yes=1, no=0
new_ds0$sex[new_ds0$sex == "male"]="0"
new_ds0$sex[new_ds0$sex == "female"]="1"
new_ds0$smoker[new_ds0$smoker == "no"]="0"
new_ds0$smoker[new_ds0$smoker == "yes"]="1"
# Now fit the model with only region is northeast.
fit6 <- lm(charges ~age+sex+bmi+children+smoker+smoker*bmi+
as.numeric(smoker)*age+I(bmi^1.5)+I(age^1.5), data=new_ds0)
summary(fit6)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## smoker * bmi + as.numeric(smoker) * age + I(bmi^1.5) + I(age^1.5),
## data = new_ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9406.3 -2174.0 -1538.2 -518.7 21804.4
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5132.11 8108.96 -0.633 0.52726
## age -99.62 266.41 -0.374 0.70870
## sex1 792.23 563.31 1.406 0.16060
## bmi 577.85 748.29 0.772 0.44056
## children 716.01 247.08 2.898 0.00402 **
## smoker1 -18454.30 3635.64 -5.076 6.62e-07 ***
## as.numeric(smoker) NA NA NA NA
## I(bmi^1.5) -58.07 91.07 -0.638 0.52417
## I(age^1.5) 36.39 28.46 1.279 0.20201
## bmi:smoker1 1444.18 115.75 12.477 < 2e-16 ***
## age:as.numeric(smoker) -45.95 52.15 -0.881 0.37901
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4997 on 314 degrees of freedom
## Multiple R-squared: 0.8084, Adjusted R-squared: 0.8029
## F-statistic: 147.2 on 9 and 314 DF, p-value: < 2.2e-16
AIC(fit6)
## [1] 6450.074
BIC(fit6)
## [1] 6491.662
The AIC and BIC actually get smaller.
Similarly smoker*age is not that important either.
fit7 <- lm(charges ~age+sex+bmi+children+smoker+smoker*bmi+region+I(bmi^1.5)+I(age^1.5),data=new_ds0)
summary(fit7)
AIC(fit7)
BIC(fit7)
summary(fit7)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## smoker * bmi + region + I(bmi^1.5) + I(age^1.5), data = new_ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11786.7 -1716.1 -1351.6 -665.2 30838.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7441.29 3664.90 -2.030 0.042513 *
## age -323.19 124.72 -2.591 0.009662 **
## sex1 509.88 263.69 1.934 0.053369 .
## bmi 1075.11 326.86 3.289 0.001031 **
## children 681.69 114.16 5.971 3.02e-09 ***
## smoker1 -20425.25 1630.46 -12.527 < 2e-16 ***
## region1 1036.80 309.38 3.351 0.000827 ***
## I(bmi^1.5) -126.75 38.87 -3.261 0.001138 **
## I(age^1.5) 62.57 13.29 4.707 2.77e-06 ***
## bmi:smoker1 1443.95 52.09 27.719 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4795 on 1328 degrees of freedom
## Multiple R-squared: 0.8443, Adjusted R-squared: 0.8432
## F-statistic: 800.1 on 9 and 1328 DF, p-value: < 2.2e-16
AIC(fit7)
## [1] 26488.94
BIC(fit7)
## [1] 26546.13
After dropping smokerage, AIC and BIC values stay the same.For now, the best model would be: charges = constant + age + sex + bmi + children + smoker + region + bmismoker + bmi^1.5 + age^1.5
Caution: As you may see, the coefficient of smoker1 is negative. Please don’t take the result as smoking is good for your health. Since smoke is also used in smoke:bmi, bmi is large. So the bad influence of smoking has been transferred to the smoke:bmi term.
(Yijia)